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ABSTRACT 

This report examines a series of general models that 
represent the process of merging records from separate files when it 
becomes essential to inhibit iden tif iabilit y of records in at least 
one of the files. Models are illustrated symbolically by flow 
diagrams, and examples of each variation are taken from the social 
sciences. These variations cover simple situations such as that of 
soliciting anonymous data from previously identified respondents as 
well as more complex merge operations, e.g., merging files from 
mutally insulated data banks and merging data under code linkage 
systems. Characteristics of the models are discussed with emphasis on 
their benefits and disadvantages. (Author) 
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Inocft figure r »< t Kcfr 



A{rn<> A. c^nirnlMni del* file At. initiate! action t** *»r*c 1 1 • data 
WJ|h A*^m? R. controlling *»»* Bt. 1" °' drr l ° conduct the arrge operatic. 
th« Jir.t agency create* !ilc A*l, containing complete l dent l 1 l c at 1 on <!• 
ott each individual and a crypt ographlc el ly encoded record (A* i of each ln- 
divs iu*r> attribute* (e.g.. academe performance, personality character- 
i.tiea. etc.*. Encoding 1* baaed on a computing algorithm which must be 
unasaiU! U lo *5*t Agency . 

^ 1(4 Tile A’ I U then trwwitteii to A*rw> » * r ^ u acra^- vlth 
El by thf afency. The merging l* baaed on tt, «w>n identification Included 
in li lr> a * 1 and 11 1 . A* a record from one file u matched and merged with 
a c urr e a pond i ng record from tV.e second file. Agency d delete* the identifier 
lu both records. 

A» a result Ol the merge and delete operation*, the file labeled A*B 
i* produced. bach encoded statistical record from file A* 1* associated 
with the proper statistical record from file 1. and the records are vir- 
tually an* — lyrnnus. 

I lie A* B 1* returned to Agency A, which then decodes the records pre- 
viously encoded. The decoded statistical file AB la then ready for editing 
and analysis by Agency A. 

Under optimal use of *>del 2.0, Agency A has the data It requires 
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t h< ide ntity of Atmty B and lt.po.ln* minor con. trains an th< i low 1 in- 
lorn-.* ion inpli« * by th« model. Auunin* that Agency A represent* * *in K !e 
r< .care li<- 1 who initiate action, we can consider Agency » a. c 1 I a *ln*le 
m«t i tut l on; <2> .everal independent inat 1 tut 1 on* ; O' a specific researcher 
,, r re - arch *rou P ; and (4) the reapondent. Each identity of Agency B 
,„***.*. different adaiini.tr-tlve regulation* and different reference group. 

r rhar AC tet 1 fttic » are 4 i *C U S «*<*4 

to wtuct) anonymity v * * “ r,4fc • - 

in the following section. 

H.ncU- m.tltution : In many instance. the social .ctentt*t -ay wish to 

merge hi* own data with Information controlled by public or private Insti- 
tutions. Municipal, state and Federal agencies «ay , for example , maintain 
demographic and medical data on individual, fro. whom the researcher has 
already acquired data. .Tivate agencies, including schools, medical insti- 
tutions, market research and polling organlaat ions , may also have obtained 
data of interest to the researcher. Insofar a. these institution, have 
formal regulation, for preventing third party access to Identifiable records, 
the usual merge operation implied by Model 1.0 t. not acceptable. In this 
situation. Model 2.0 become. . convenient device for merging the re.e.rch-r’s 
data with institution.! records without violating any institutional regu- 
:RJC tat Ions . That the model 1. feasible l. evident from the social experiment. 
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v n<1uct<'i by Sc liwar 1 7 and Orlrarn <19*71. 



These rcjfirchrr* .• mp l oye o the 



-,<»<!<• 1 n pwrRi ng experimental data will, IRS record, on the individuals 

without c o-iprotai * ing t*»f anonymity oi the individual with respect t his 
'wn t .ix r re ^r<l . 

y,-r. subtle use* of Model 2.0 concern those institutional record , 
which fall into th- category -l public information or into the more nn- 
hiRuous area which Lister (1969) desi R nate8 "pseudo-pubt ic records." In 
either case, the bona fide legal difficulties which the researcher con- 
fronts in accessing those records may be exacerbated by ambiguous institu- 
tional regulations, by vaguely defined statutes and laws, or by idiosyncra- 
tic enlorcement of regulations, e.g., by institutional administrators. 

H the researcher can anticipate such difficulties In research which 
is endorsed but is also impeded by (ostensible or real) concern about 
confidentiality of records, then Model 2.0 can be used to resolve the Issue 
and achieve rescaren objectives. 

Multiple institution Case : A schematic digram, representing the multiple- 

institution variant of Model 2.0, is presented in Figure 3. It should be 
evident that logistical problems become nxich more complex when more than 
one separate institution is Involved with Agency A in merge operations -- 
at least twice as many encode-decode operations are implied if the pattern 
of Mode 1 2.1 Is used with Agency A and Agency B. Specifically, Agency A 
oust encode its data AI; Agency B must encode its data file A*BI and trans- 
mit the encoded file A*B*I to Agency C. 



Insert Figure 3 about here 

Encoding of files. AI and A*BI arc necessary In order to prevent personnel 
at Agency C from Interrogating identifiaole records. When File Cl Is 
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nv r k< <1 with th«- encoded data, any identifiers arc removed and the result- 
ing l i le, A'B'C is returned to Agency B for partial decoding. Having 
decoded data pertaining to File B, the File A ' BC is returned to Agency A 
•or further decoding, editing and analysis. Given ample time, funds and 
accurate processing of data files, all these tasks can be performed easily. 
However, I know of no good example to Illustrate an actual application of 



the mode 1 . 




It is interesting to note that this variant of Model 2.0 provides a 
kind of primitive resolution to the problems and issues implied in the 
abortive proposal for National Data Center (Dunn, 1968). Rather than ac- 
cessing all data under the auspices of one governmental agency, i.e., the 
National Data Center, the independent researcher could, for example, 
solicit and — ge Identifiable information from both the U.S. Census Bureau 
and the Into . Revenue Service without violating rules for confidentiality, 
by using the model. A similar variation Right involve separate social research 
agencies or social scientists all participating in a cooperative program 
which depends on a common pool of subjects. Each agency, for example, 
could maintain a unique set of data on the same subjects, or, each data 
file night represent one descriptive time frame or cross-section for static, 
descriptive research enterprises; the total merged data file constitutes . 
an empirical basis for longitudinal research. The Implicit assumption 
here is that all agencies would cooperate in providing the resources to 
Implement the model or to permit outside manpower to actually merge files 
under cooperative surveillance. 

Independent Researcher : When an Independent researcher or research agency 

constitutes the auspices under which Data File BI is maintained, several 
kinds of constraints on Agency B's operations can make Model 2.0 a useful one. 
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In many circumstances, the social scientist promises the respondent 
that his response will bo uf.ed only for research purposes and that summary 
data will be presented only In statistical form. The implication, for 
many respondents at least, is likely to be that the data will be kept under 
the auspices of the researcher and that identified records will not be 
handled in a way which permits a possibility of disclosure of data to any 
other parties. 

A researcher may, however, choose to furnish identified data to a 
professional colleague for research use, usually with a verbal agreement 
that the colleague must not abuse or disclose identified records. This 
kind of exchange is, of course, a cause for ethical and legal concern if 
full confidentiality was promised initially. Should the respondent or his 
r epresentative s view this practice as a violation of confidentiality, 
based on their interpretations of the original promise, then the use of 
Model 2.0 may help in ameliorating ethical problems. In essence, only 
statistical information is exchanged under the model, while identif iability 
of records is preserved in accordance with the original promise of con- 
fidentiality. Note that identification of membership in a sample (on 
which Files A or B are based) is presumed here not to be a violation of 
the promise except under extraordinary circumstances. 

Respondent : Now consider the situation in which Data File B is managed 

under the auspices of the respondent himself. That is, the respondent is 
presumed to have some information about himself which is of interest to 
the researcher. Moreover, this information must be linked with data pre- 
viously obtained from the respondent in order to maximize its utility. In 
this situation, the researcher constitutes Agency A (with previously ob- 
tained Data File AI) and the respondent constitutes Agency B with information BI. 
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• Under ordinary circumstances, direct inquiry to B from A is a con- 

venient mechanism for soliciting survey information and, when these con- 
ditions prevail, Model 2.0 is ratuous. However, there are several situa- 
tions in which Model 2.0 may become essential. Consider any inquiry to 
which a response furnishes very unique and socially undesirable facts about 
an individual. In addition to the response's potential unreliability (or 
complete absence), the question itself may become illegal, in the extreme 
according to some experts (e.g., Goldstein, 1969). There are several al- 
ternative strategies, based on Model 2.0, which the researcher can employ 
in circumventing these problems. 

For example, the researcher can punch information from each of his 
records into a single perforated EAM card for each individual. Some of 
the card columns are left blank for the data to be solicited. The re- 
searcher may then furnish each member of his (potential) respondent group 
with a card, with instructions on its function and use, and with the ques- 
tions of interest to him. By punching out the appropriate columns and by 
punching out all perforations in the identifier columns, the respondent, in 
effect, merges his own data file with the researcher's while maintaining 

his anonymity. The cycle implied in Model 2.0 is completed when each member 

3 

of the respondent group returns his card to the researcher. 

The researcher’s original data set may or may not be encoded. Decoded 
information would be warranted if there was no reason to expect the infor- 
mation to influence the individual’s decision to respond or the substance 
of his response. The decoded information, together with an explanation of 
its meaning may be essential if there is some distrust of the purposes or 
methods of the researcher. On the other hand, the data should be encoded 

if there is some risk of disclosure to third parties during the process of 

O 
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punching cards and transmitting them to the respondent. 

Note that, if the encode-decode operation is eliminated from this 
paradigm, the model is analogous to the classical mailout-mailback ques- 
tionnaire scheme when the questionnaires are mailed back anonymously. 

Utility and Corruptibility of Model 2.0 

The Campbell -Schwartz model, when employed correctly, is attractive 
in several respects. Its. logical basis and composition and the necessary 
flows of information are all quite simple. Yet, as we have seen, the 
general concept is quite flexible in that it can be generalized to a 
variety of organizational situations. Furthermore, the objectives and 
the steps for implementing the model are clear enough to facilitate com- 
munication with researchers, administrators of data files, and with the in- 
telligent layman who expresses a reasonable apprehension about the union 
of data files. These properties suggest that the model can be a reasonable 
for merging data when record identifiability in any one file must be elimin- 
ated relative to the agency have no control of the file. 

There are, however, two major potential weaknesses in the model, 
which can undermine and perhaps destroy any utility it may have. The first 
disadvantage is a logistical one: few agencies or individuals who are 

placed in the role of Agency B may be capable of accurate match-merge opera- 
tions even when the volume of data is small. Merging large data files 
can be very expensive, particularly when search and match strategies, whether 
computerized or manual, are inefficient (see DuBois, 1969, for discussion 
of this point) . When the respondent plays the role of Agency B, imple- 
menting the model may be very difficult because of his resistance or in- 
difference to the research, communications problems between researcher and 

O 
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respondent, etc. 

A second disadvantage involves possible corruption o f the model by 
Agency A or Agency B. When the encoding transform Is a good one, it is 
impossible for Agency B to corrupt the system unless it had access to the 
decipher key or to the actual file AI. I will assume that any such access 
can be prevented by the usual physical safeguards and personnel checks, 
otherwise there is no real justification for encoding (Peterson and Turn, 
1969 describe and evaluate these safeguards). 

Agency A, on the other hand, may corrupt the model in at least three 
ways : encoding duplicate identifiers, using dummy records, and >uerging 

, data sequentially. Using the first method. Agency A duplicates identifiers 

in each record, producing a file All; then, data set A and one set of iden- 
tifiers are encoded, producing Data File A'l'l. The deletion of I after 
match-merging by Agency B is fatuous, since Agency A can decipher Data 
File A'I'B and acquire identifiable merged records. 

The second mechanism for corruption involves the use of attribute 
data as partial identifiers. If each individual's record is completely 
unique, the statistical record itself constitutes an identifier. Again, 
the deletion of formal identifiers after the match-merge process by Agency 
B is fatuous; Agency A's duplicate file of AI can be used with the unique 
statistical records to disclose the association between the formal iden- 
tifiers (I) and elements from Data File B. A variant on this method of 
corruption is also possible through sequential match-merge operations. 

That is, one can solicit sequential merges of data, using different ele- 
ments in the B file to construct a dossier on specific individuals in the 
AI file. Although time-consuming, the strategy is feasible and well-docu- 
mented by some researchers, notably Hoffman and Miller (1969). 

me 
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„nd least amenable to easy soluilo,.-. indet thf . s..ump« 4 .>«. vc *CI^( 

I o develop rare han 1 »os t of inhibiting and donating t^cat, «►* ori-pM-* 
in the next section. An obvious device for ameliorating the logistical 
problems -- shifting merge responsibility to an independent brokerage 
Is discussed in the succeeding section of this paper. 

More Secure Versions of Model 2.0 and Model 2.1 



For inhibiting the possibility that Agency A will subvert the purpose 
of Model 2.0, three kinds of counter-measures appear to be reasonable -- 
trusting and/or licensing the initiating agency, monitoring the merge pro- 
cess, and extending the responsibilities of Agency B to limit the access 
which Agency A has to raw data files. Of these three activities, only the 
last two can seriously be considered as counter-measures to corruption and 
only the last activity (resulting in Model 2.1) reduces physical threats 

with economy. 

Trust in the social researcher has been a classical basis for his 
activities. This trust is often an essential element in soliciting, main- 
taining and merging data on individuals overtime. It appears to be par 
ticularly necessary to the conduct and evaluation of ameliorative programs, 
be the program directed toward unified groups of individuals or toward a 
single person. The sociolegal formalisation of this trust, or licensing, 
has also been coimnonly employed as a mechanism determining the trustworthi- 
ness of a particular researcher or research agency. Insofar as trust in 
Agency A or formal licensing of the agency are justified, and criteria for 
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Ornish research lining* vf.uh ate evch irtc t the •««•«?' « *•* 

hrnr'to <- j corrupt! n «r, P*** 1 ’ * r 5r ” »*-«*«»t. 
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int«-gr 5 ty and ri|or«i» Hccmioi r equl revrnt • itc often * n * 5 M,fcer 

c oo*uving, if vc can discover other safeguards v« »ay be able f° eliminate 
entirely (he need to rely solely on the apparent integrity •’’I A * 

one possible strategy t«r detecting *»J preventing • errupt ioa of the 
kind* dc*cr l bed relic* on the u*e of Monitors during the verge pro«e*t. 

That is, Agency B night continuously observe the *.ondue» of the verge 
and a**»lne the physical content* of data Hie. supplied by Agency A for 
jlv Merge. me e*aa*ination of content* S»*£t facuicd ^ 

uniqueness of each and every statistical record and to prevent natch- 
verges of de facto identifiable statistical records. Abo, sequential 
verges can be vonitored sc as to inhibit attevpts to evploy the ft ques- 
lions strategev in building dossiers on identifiable subjects of the verged 
files. Monitoring, however, vay he too expensive, tine consuving. or weak 
to detect and prevent all but obvious attevpts to corrupt Model 2.0. In 
fact, it would be difficult if not ivposalble for a vonitor to detect the 
presence of encoded identifiers (i.e.. Data File A'l'I supplied by Agency 
A) if sophisticated enciphering techniques are used. Given these 
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J, r i|. ow* analysis ef tf« W rr^ <*•«.. f» •’»!•»*» *•« *t*t»*tlca: MNMr|e * 
rfA)(f , wr^dh the- p.t-1 «r » • : f or corrupts"-* <»*y encoding identifier* <»r 

p, df facte- statistical Identifier** v*»cn certain condition. are •*«. I’"* 

.,.«dltio«. .re .inple and depend ^ the kind of data condensation vhlch I. 

prase rs bed an* developed. Cro... tala. I at Ion. and a.aociatod raatlonal 

.l.ll.llc. (e.«., Chi • square and phi coefficient*) can be priced under 

the constraint that the .bserv.d frequencies within all cell, be anove a 

certain number. Similarly. A*ency ft nlftht require that all parametric 

t **t l»C!c . br based on at lea»l JO observation* within a »lven croup. 

»ote that Wflll fine i* still necessary to detect a«J P*™* th " 

ti.l net hod of corruption when frequencies count* or cross-tabi are 

4 

»otul(c4 per iodic* 1 1 y by Agency A. 

When statistical *u«*arie*. rather than raw data, are furnished, 
cryptographic encoding take, on a bit different ca*t. Some o. the classi- 
cal encoding nechani an* •• infinite key transforms, for cuanple -- change 
the character and mathematical properties of the record completely. Sta- 
tistics based on such transforms are neanlngless. In lieu of such 
key transforms, the researcher can enploit one obvious option which depends 
o„ the kind of scales (nominal, ordinal, interval) charactering the data 
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<r. 6 .. n *i. 1»«* <’« data -r ei*»**:c )iiv,t functJ-«« <‘ S thr 

<?ata <a« hr cnpl^Tc*. The a«a«i*Clca! tran*for»»l |oft« **»>. l« * n > 
rVf M, br r.irntit’. f»r lot «- 1 5 l *«-nl an*ly*i« of the 4*t«. Both »tetl»tlca 
404 Utyrct transform# J>«Kut» rather than cryptographically encode data. 
H^vrr. thi» strategy ought t «• he .uffUlent « 1 * Inhibit overt mterro. 
fiat i 00 of data file by the broker, Agency 9. and A|««y A when each of 
tlwtf group* twwuort the merge procc* . In addition to on-.ltc nonitor- 
lntt, the usual precaution* agaln«t Interr ^gat Ion of llle* stored (te»- 
porarl ly t In a computer or EAM equlpm ent , can be used to prevent duplica- 
tion of file* for later mterrogat Ion, etc. (»ee Peteraon and Turn, I9b8 
for a complete ll*t of precautionary mra*ure_ in a computer environment ) . 

One additional safeguard can be empoyed by Agency B to minimize the 
utility of the potentially Identifiable record. In File Bl, In Model 2.0, 
or node! 2.1. Agency B can *imply Innoculate error. Into the record, on 
that copy of file Bl wtuch l. involved ift tl»f aerge. It Is possible for 
Agency ft to control the .tatUtlcal propertlc of the random error which 
1* introduced and. although the Integrity of any paitlcular record l« 
undermined, the statistical condensations of the merged (Imperfect) data 
file can be corrected for error, using comma mathematical techniques 
(.ee, for example. Cochran I19b8)). Correction, may be made by Agency A 
a. part of data .u—rUatlon In Itodel 2.0, and by Agencies A or B In 
Model 2.1 when distributional properties of the evror are known by both 
agencies. For a description of the limitation, of this technique, see 

Boroch (1970). 
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In t hi c •rcri<'n, wo inc i;..'r#tr an intermediary agency Into the for- 
aaJ mructt.ro of thr car I ; e r «*dels; two principal function" of this 
broVor Include match-merging data (Model 3.0) or maintaining code link- 
age a ' 4 i ncu n fted below aa Model 4.0|. 

Model 3.0 l» illustrated in Figure the figure represent* a direct 
extension of Model 2.0, containing moat of the same elements and flows 
of information. 

Insert Figure 5 about here 

In this model. Data File AI l* generated by Agency A, and the sta- 
tistical portions of each record are encoded (l.e., Al becomes A I). 
Similarly, Agency B generates encoded Data File B'l, using a different 
enciphering algorltta. The two resultant files. A* I and B'l, are match- 
merged by the brolur, based on the unique Identification portion of each 
record. Encoding, of course, protects the files against Interrogation by 
the broker during the merge process. Following the match-merge operation, 
alt Identifiers are deleted and Data File A'B* Is returned to Agency B 
for decoding. This partially decoded file, A'B, Is then sent to Agency A 

for decoding, editing, and analysis. 

By moving responsibility for match-merging from Agency B to the 
broker, we have reduced some of the technical expertise and manpower 
required of Agency B, thereby ameliorating a disadvantage of Model 2.0. 

A decoding operation has been added but this Is likely to be no more of a 
problem for Agency B than the original encoding. If Agency B considers 
this operation to be an unwarranted Imposition, the agency can simply 
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provide A p.r nc y A with the decipher code and 'rt Agency A decode the B* 
r.eci ion 5 of the merged file. 

Although an economic problem is resolved by Model 3.0 and the en- 
coded data are secure against disclosure to Agency B and the broker, the 
potential for corruption of the system by Agency A still has not been re- 
duced. 2-iouel 3.1, an obvious extension of Model 2.1, presents one resolu- 
tion of this problem. The broker, in this case, is assigned responsibility 



Insert Figure 6 about here 




for sunsnarizing the data (where the summary is specified a priori by 
Agency A) as well as merging the files. As in Model 2.1, monitoring is 
necessary to prevent use of the 20 questions strategem in corrupting the 
system. Also, a transformation of the data and secrecy of file contents 
are essential for eliminating the possibility cf the broker corrupting the 
system. Also, innoculation of random error with known parameters will 
help to minimize the utility of identifiable records to the broker and 

to each agency. 

Perhaps the best method of further inhibiting the broker's ability 
to interrogate identifiable records is to cryptographically encode the 
identifiers in each file, using an encoding scheme developed jointly by 
Agency A and Agency B. So long as the same encode system is used in 
each matching identifier, the merge can be conducted yet the possibility 
of interrogation is virtually eliminated. 

Variations on Models 3.0 and 3.1 and Their Corruptibility 

Models 3.0 and 3.1 can be manipulated in the way prescribed earlier 
in order to demonstrate the variety of situations to which the models are 
applicable. Instead of varying the identifiers of Agencies A and B, however 
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wr rwiv change thr Identity of the broker more conveniently, 
variants .irr considered here: <t neutral agency, respondent, or re* 

scare' «r, each considered as the broker in the system. 

Neutral Agency : In ^onc instances, it may be possible to engage an 

agency which Is relatively independent of the other agencies involved in 
Model 3.0 and 3.1 and of any third parties which might attempt to Inter- 
rogate merged files. For example, a governmental agency such as the 
Census Bureau can play the brokerage role when the effectiveness of the 
intermediary is dependent on constitutional protection of potentially 
identifiable merged files. A need for such protection is evident if the 
union of files jeopardizes respondents more than separated files do, or 
if the data for each separate file had been gathered initially under 
statutory or constitutional protection. The use of the Census Bureau 
in a more generalized brokerage role, and the use of a specially created 
government agency to fulfill a similar role for social scientists has 
been discussed by Dunn ( see Westin, 1965) and recommended in some pub- 
lished legal opinions, e.g., in the Valparaiso Law Review , (1969). 

One of the problems here is that Federal agencies are not likely, 
at least in the near future to regard themselves as brokers for social 
scientists who wish to merge data. Unless legislation or regulations are 
created to spe.ify that this must be one of their missions, the agencies 
will probably not have the manpower, computer facilities or other logis- 
tical support to implement Models 3.0 or 3.1. 

Under these conditions, comnercial service organizations might ful- 
fill the role of broker with dispatch and with a good deal of security 
for the data. Highly confidential and secret records are processed 
routinely by computer service groups, for industry and for cunicipal. 
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itatc, and Federal government*? When the Identifiers and statistical data 
ar , encoded by Agencies A and B and when there is strict monitoring of 
the merge process (with safeguards against secret reproduction of files, 
merged or otherwise), there appears to be no critical problem in using 
such an agency. The agency, of course, cannot furnish statutory protec- 
tion for the files it processes, as the Census Bureau or similar variants 
might bo able to do* 

Respondent . Suppose that Agencies A and B, be they independent researchers 
or institutions, cannot agree on a choice of institutional broker. Their 
unwillingness to do so may be caused by general distrust of the candi- 
dates for brokerage or their suspicion of the model, by the expense and 
logistical problems involved in implementing the model, or by the diffi- 
culties in monitoring the merge (and perhaps statistical summarization) 

process* 

Under these circumstances, the individual on whom records are main- 
tained (i.e., the respondent), can substitute as a reasonable broker. 

That is, the respondent can merge data through mail out -mail back methods 
or through more controllable techniques within institutional environments, 
when his record from each file is presented tu him in appropriate physical 
form. This strategy is analogous to the one presented earlier -- match 
merging data when the respondent is identified as Agency B in Model 2.0. 

As in Model 2.0, encoding-decoding operations are optional, depending on 
the potential for unwarranted disclosure of information during the record's 
processing and transmission* 

Using the respondent as broker is inconvenient and inferior to other 
strategies insofar as nonresponse rates are 1) ’ :ly to be high and logis- 
tical problems are serious. Moreover, any of the corruption strategies 
ERLC mentioned in connection with Model 3.0 are applicable in this case. The 
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respondent-broker substitution is not a good one unless there are some 
other guarantees that Agency A is no* - interested in obtaining identifiable 
records. If these guarantees are absent, a fourth agency might be in- 
troduced to the system; the agency must be dedicated entirely to comput- 
ing summaries of the data, destroying merged records, and furnishing 
the summaries to Agency A under the safeguard conditions prescribed 
earlier . 

Researcher: Using the researcher as broker in Models 3.0 or 3.1 requires 

a slightly different interpretation of the information exchanges des- 
cribed earlier. Specifically we can impose the constraint that Agency A 
and Agency B are actually the respondent at two different points in 
time. Rather than encoding statistical portions of the record each in- 
dividual encodes his identification uniquely and in accordance with his 
own enciphering technique. The consistent use of this alias at points A 
and B in time, in conjunction with the researcher to act as broker permits 
match-merging and summarizing the data. Aliases can be constructed sys- 
tematically using a variety of instructions (see Boruch [1970]) and so 
long as the researcher lacks the ability to link aliases with true iden- 
tification, the anonymity of the respondent is protected. (Note that 
flow lines in Models 3.0 and 3.1 must be adjusted so that merge, summar- 
ization and analysis of results are conducted under the auspices of the 
researcher. ) 

Code Linkage Systems: Model 4.0 

In some research programs, code linkages between different data 
files may be maintained indefinitely for possible use in merging the 
files. The justification for the linkage and the physical generation of 
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the linkage seem to depend in a large measure on the kind of research 
which is being conducted. Therefore, employment of code linkages is 
discussed primarily in terms of published examples of the systems. The 
basic composition of the code linkage model is given in Figure 6. 

Insert Figure 7 about here 

The model is characterized by three basic elements; the two agen- 
cies which maintain the data files and a broker to facilitate match- 
merging. If we delete the broker from the system, this model becomes 
closer to Model 2.0 in conceptualization; the benefit of having the 
broker depends on the broker's ability to implement those processes 
which Agency B cannot. The model works in the following way. 

Each element of statistical data in each record of Data File A is 
encoded by Agency A; identification has been previously encoded under a 
different encrypting technique. Similarly, Agency B encodes its own 
statistical data using a unique encoding technique; identifiers in this 
data file, as in Data File A, have been encoded previously (I M ) using an 
encoding scheme which differs from all others used in the process. The 
two resulting data files. A' I' and B'l”, are transmitted to the broker, 
which then merges the data based on its knowledge linkage between «_oded 
identifiers (i.e., I' I"). The resultant Data File A'B', is returned 
first to Agency B for decoding and then to Agency A for further decoding 
and ai.alysis . 

This model exhibits several potential benefits over Models 3.0 and 
3.1. Protecting the records against corruption by Agency A is unnecessary 
under optimal operation of the model, since the model specifies that 
Agency A maintains only encoded identifiers in its own statistical record. 
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The likelihood that the broker can decipher both encoded records and en- 
coded identifiers is low, suggesting that the broker can be monitored 
under less stringent conditions (i.e., requiring less manpower) than in 
previously suggested models. The opportunity for third parties to pene- 
trate any files during the processes implied by the model is also minimal. 
Finally, if the code linkage is maintained under very secure auspices 
(free from third party interrogation, legal or otherwise), the routine 
maintenance of data as well as merge process is virtually free from the 
possibility of any disclosure of information. 

So far, 1 have not mentioned the actual mechanism for generating 
the encoded identifiers and code linkages. This mechanism is crucial to 
the integrity of Model 4.0 and to its distinctiveness relative to other 
models. How might such a code linkage be generated and maintained? 

Two published descriptions of code link use are examined below, 
with special regard for the method of generating code linkages and the 
corruptibility of models implied by each description. The Manniche- 
Hayes system is an early variation, developed well before the interaction 
among social research, computerized records, and the privacy issue became 
important. A second model, exemplified by the ACE LINK FILE System, 
was created in direct response to public and professional apprehension 
about maintaining identifiable records in a longitudinal research program. 
Manniche-Hayes System 

Figure 8 illustrates a system developed by Manniche and Hayes which 
permits a researcher to solicit and merge information on a pool of in- 
dividuals, using two sources of data. The two sources include a broker 
who obtains information from identifiable archival records, and the 
d respondent himself. The broker's function is to control solicitation of 
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data and to construct the code linkage system which is used by the 
researcher to merge data furnished by the respondent and by the 

broker. Insert Figure §_Abou't_Here 

Figure 8 is interpreted as follows. The broker compiles Data File 
AI from existing identifiable records. Then, identifiers in the file 
are encoded and the resulting file AI' is supplied to the researcher. 

The broker simultaneously creates a file linking true identifiers with 
the encoded identifiers; this dictionary file is designated II ' in the 
figure. Each respondent also creates two kinds of records, where iden- 
tifiers in records are encoded arbitrarily by the respondent himself. 

Data File BI is then transformed to BI" and supplied to the researcher. 

Each element in a second dictionary file II" is supplied to the broker 
by each respondent* 

The broker, having both dictionaries, II* and II", match-merges 
these on the basis of common true identifiers (I) and supplies the 
resulting code linkage file to the researcher. Given Data Files AI* 
and BI" and the code linkage between the files, II", the researcher 
can merge the files easily® 

Utility and Corruptibility of the Manniche- Hayes Model 

Assuming that the broker is not corruptible, and it would be dif- 
ficult if not impossible for the researcher to obtain any identifiable 
records on the respondents. The usual physical safeguards and monitor- 
ing devices can be used to inhibit overt attempts by the researcher to 
corrupt the system; the absence of access to any identifiable records 
makes corruption via encoded identifiers almost impossible. The 20 
questions strategem by the researcher can probably be detected by the 
broker if the broker monitors the data which it supplies to the researcher 
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and the data supplied by the respondent. 

The most obvious weakness in the system is the broker, simply be- 
cause this agency does have access to fully identifiable records in one 
file and to the complete code linkage system. If the broker is officially 
responsible for maintaining file AI, then there is no particular threat 
unless the broker has a definite interest in expanding its information 
system to include File BI; if there is little physical security for the 
BI n file, the broker may gair access to it and conduct it's own data merge 
using it's dictionary files. 

The potential for collusion between researcher and broker is also 
evident. If, as Manniche and Hayes suggest, the broker is a profes- 
sional colleague of the researcher, the likelihood of collusion is 
bound to be perceived as high, regardless of it 1 s true likelihood. 

In order to lower the probability of collusion, we might employ some of 
the strategies described earlier. The brokerage role can be limited to, 
say, neutral agencies which can gain nothing by collusion and may suffer 
punitive action as a result of collusion. For example, a school registrar 
might be required by administrative regulations and/or municipal law to in- 
sure that his records are never identifiable to third parties. Punitive 
action can be taken against the broker if its attempts at corruption of the 
system are detected. In this case, the Manniche- Hayes model is not substan- 
tially different, in advantages and limitation, from the Campbell-Schwar tz 
model . 

ACE Link File System 

One of the most interesting variations on Model 4.0 has been developed 
recently at the Office of Research of the American Council on Education 
O (Astin and Boruch, 1970; Boruch, 1969). Illustrated in Figure 9, this ex- 
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perlmcntal system employs a foreign intermediary (l.e. , a broker) to main- 
tain the code linkage file (1*1"). Data File Bl represents information 
gathered by the research agency at time T^, while Information for Data 
Files AI is collected and cosolldated at some later point in time, . 

Insert Figure 9 about here 

The link file itself is created at in conjunction with the trans- 
formation of identifiers in file Bl. Also at time a dictionary is 

created (in effect) with three kinds of data: true identifiers (I) and two 

sets of encoded identifiers I 1 and I"). The two encoded sets differ from 
one another in physical contents and in the manner in which codes are 
generated. File Bl" is constructed by replacing true identifiers (I) with 
one set of encoded identifiers, resulting in the file maintained on-site 
by the research agency (Bl”). After this operation, the dictionary is 
used to construct the link file I 'I" and a second dictionary II r . The 
link file I 1 !” is then sent to the broker; the first dictionary, II'I", Is 
destroyed as are any researcher's copies of Data Files Bl or I' I". 

Later merge operations are conducted in two stages. The broker merges 
Data File AI' with the link file, I'l", and deletes the set of Identifiers 
I'. When this file AI" is returned to the research agency, it is merged 
with file Bl" by the researchers on the basis of common identifiers, I". 
Utility and Corruptibility of the Link File System 

When the model is adhered to rigorously , the Link File System demon- 
strates some important ways for preventing interrogation of identifiable 
records during the merge process. Data File Bl" is virtually free from 

penetration even by Office of Research staff, since Identifiers are encoded 

O 
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.ind chc decipher key C 1 1 * I ** » ha» been destroyed. Similarly, the process of 
meriting AI ’ with BI” is free from threat of the broker’s penetration since 
only encoded identifiers in File AI * are supplied to the broker. The 
physical merge process appears to be quite safe because the researchers 
themselves cannot decipher the encoded identifiers in A’.'* and BI". A final 
benefit is that data file BI” (and all succeeding data files) can be maln- 
tained without risk of extra-legal or legal interrogation of files. True 
identifiers are legally Inaccessible by subpoena, if the broker is a foreign 
agency and if the agreement between broker and researchers specifies that 
the linkage be kept secret and secure, even from the researchers themselves. 

These and other advantages described by Astln and Boruch (1970) are 
impressive. However, this model is vulnerable to some of the same corruption 
strategies mentioned in the context of Model 2.0. The problems described 
below are based on a few of the writer's own perceptions, and on two very 
professional critiques supplied by Dr. Rein Turn of Rand Corporation and Dr. 
Lance Hoffinan of University of California at Berkeley (both personal coo- 
umnications) * 

Suppose we consider possible corruption of the system by members of the 
research agency. First, there is no real guarantee that the agency actually 
destroys copies of files BI or the code linkage I’l"; given the files AI and 
U * t of course, completely identifiable records (of the form ABI) can be 
constructed, subverting the purpose of the system. 

Actually, covert duplication and maintenance of files BI and I'l" by a 
memb er of the research agency or the failure to destroy original files at 
the appropriate time is not really necessary to permit later interrogation 
of identifiable records. One need only construct a dummy variable in each 
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and every record of Flic BI"; the d usury must contain covertly encoded true 
Identifiers and link file characters. This strategy is a simple extension 
of one mentioned in connection with Model 2.0 - * encoding identifiers. 

The brokerage agency constitutes a second potentially weak element in 
the system. The 20 questions strategem can be employed here to corrupt the 
systen. In this case, the broker's objective may be to construct identi- 
fiable records corresponding to Data File AI. If the broker has access to 
the list of individuals whose records are maintained it may then construct 
its own file of coaanonly available data about those individuals. Given 
these data, its copy of Data File AI', and the documentation for the file, 
the broker may be able to interrogate the file and build its own dossiers, 
using the 20 questions strategem. This would be particularly easy to do 
ifjth relatively small numbers of individuals and a large number of elements 
in each record. One convenient way to ameliorate this difficulty has been 
suggested by Lance Hoffman: The researchers must encode the statistical 

position of each record (that is A is transformed to A") using a unique 
encoding scheme •■.'-.ich is unavailable to the broker. 

Dr. Turn has emphasized thi weaknesses of foreign brokerage as opposed 
to domestic maintenance of link files. He contends that one objective of the 
system — keeping link files secure from legal penetration would not be 
met if certain plausible events occurred. In such occurrence, foreign courts 
may submit quite readily to our government's requesting the linkages. Normal 
international regulations may be quite unnecessary, if informal disclosure 
of files is perceived as being a friendly understanding between governments 

or as an amicable political gesture. 

If the foreign agency chooses not to abide by its contract to maintain 
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ihe link file (or if it decides to sell the information), then the system's 
functional utility is destroyed. Moreover, successful prosecution of the 
broker may be so difficult and time consuming that the system's utility 
would be imparled considerably, if not destroyed entirely. 

These kinds of weaknesses in brokerage are improbable (although still 
possible), if the broker is selected carefully, and if there are some external 
guarantees of adherence tc the model. 

In the ACE system, one such guarantee is the ACE agreement to provide 
exactly the same link file services to researchers at the foreign agency. 

If the foreign broker ignores its own responsibility toward ACE, then pre- 
sumably, ACE can make similar reprisals. This kind of countermeasure is not 
particularly appealing (if only because it is so destructive) but it may be 
a useful mechanism for deterring violation of formal contract or informal 
agreements. 

Variations on the Manniche-Hayes and ACE Systems: Relation to Earlier Models 

Both Manniche-Hayes and ACE Link File Systems were developed with a 
specific purpose different from the function of models considered earlier; 
the reader will recall that Models 2. 0-3.0 were dedicated to preventing dis- 
closure of one file used in a merge operation. On the other hand, the 
Manniche-Hayes paradigm eliminates the need for the researcher to maintain 
any identifiable record for any length of time. The ACE System limits the 
maintenance of identifiable records to short periods of time (i.e., during 
the period a link file is created). 

Both models can, with minor adjustments, be treated as variations of the 
early models and contrariwise, building on the earlier models results in 
systems that provide many of the same services that the Manniche-Hayes and 

no 
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Link File Systems provide. In the first place, one could adjust the Manniche- 
Hayes approach to permit the researcher full access to one set of identifiable 
records; this eliminates the need for the link file and makes this situation 
identic il to Model 3.0 in function, benefits, and shortcomings. Building 
from the earlier models, specifically the multiple institution (or time) 
variant of Model 3.0, we have a situation which is identical to the longi- 
tudinally operated Link File, except for the maintenance of a codf, linkage 
(and associated benefits and shortcoming of the linkage strategy). 

Both the Manniche-Hayes and Link File Systems can be manipulated in 
much the same manner as earlier models. Research agencies, formal Institu- 
tions or the respondent himself, can be used to complement the researcher 
and broker in each model. In each model, the broker may be manipulated, 
when the respondent himself is used as a broker, the Link File System is 
quite similar in form and function to the situation in which the respondent 
plays the same role in Model 3.0. 

Difficulty in Applying the Models and Consequences of Their Use 
Three kinds of problems -- technical, contextual and logistical -- are 
inherent in any implementation of the models described here. At the core of 
technical problems is the need for encoding alphanumeric information in each 
one of the models. Techniques for cryptographic encoding are likely to be 
unfamiliar to most social scientists, computer scientist or managers of 
data files. Moreover, there appear to be no standardized criteria for apprais- 
ing the adequacy, efficiency and costs of the techniques currently employed 
by commercial and military organizations. Although informal guidelines are 
currently available, it is likely that the nature of privacy transformations 
and their effectiveness will change considerably in the near future as the 
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algorithms used in code generation are more closely linked with computer 
control systems developments and microminiature circuitry advances (Taylor 
and Feingold, 1970). Under these circumstances, the social scientist who 
wishes to employ one of the models must learn to develop his own encode-de- 
code systems based on existing information. A brief description of encode 
techniques and a selected bibliography is provided in Appendix I. 

The second problem, a contextual one, involves appraisal of the need 
for a model and of potential corruptibility. Need is obviously a function 
of the nature of the data being merged and the interest of a participating 
agency or some third party in gaining access to identifiable records. These 
factors are not easy to evaluate themselves, much less with respect to the 
costs of employing one of the models and ancillary safeguards. One examina- 
tion of this issue is given by Boruch (1970), but much more systematic and 
empirical exploration is needed. The comments made earlier 

on shortcomings of the models represent only one kind of appraisal technique, 
based essentially on examination of important elements in the models informa- 
tion flow. Even in this context certain kinds of corruptibility have been ig- 
nored, e.g., collusion among agency personnel. Other methods for appraisal 
developed and these may be much more effective insofar as they permit detection 
of attempts to corrupt the systems, and insofar as they furnish use with mean- 
ingful quantitative indices of the risk of corruption. Taylor and Feingold 

(1970) present an approach to quantifying the feasibility and utility of certain 
safeguards which function as counter measures to corruption of computerized 
record systems. Still another approach involves the creation of prototype 
systems, coupled with a devil 1 s advocate group whose function it is to pene- 
trate the systems. At MIT, for example, students play this role in effect, 
when they succeed in entering a "secure" resource- sharing system, without authority 
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(knowledge of pass words, etc,) in order to get their homework done. The appointment 
of a devil's advocate as a formal position as a secure data environment has 
been suggested by a number of computer experts and social scientists. 

A third major problem concerns the accuracy and the ability to manipulate data 
files. The researcher who merely assumes that the institutional files (or 
his own files) being used in the merge is likely to be disappointed. We 
know that administrative records are subjected to distortion in a variety 
of ways and that documentation on accuracy is, as a rule, absent (see, for 
example, Campbell, 1969). 

Xf the data are known to be accurate, however, a second problem arises -- 
overload in demands on institution data files. Since the number of data banks 
is small relative to the number of available respondents, at least, and rela- 
tive to the number of social scientists, the risk of swamping institutions 
with requests to match-merge data is high. Without a formal (expensive) 
mechanism to meet a high demand, few projects are likely to be completed. 

Unless researchers are willing to pay for personnel and machine time used on 
the project, as well as overhead and service charges, official cooperation 
by institutions cannot reasonably be expected. 

Assuming that these problems can be solved at least partially, we can 
anticipate certain benefits from wide-spread use of models by the social 
research community. Acting on the recommendations made by Miller (1970) , X 
will try to list the important implications of the methodology presented 
here and to evaluate them relative to a more general reference system. 

The most obvious useful result is the enhancement of the social researcher's 
ability to obtain and analyze data without infringing on the privacy of the 

individual. Expansion of the pool of data -- in kind, magnitude, and quality -- 
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is perhaps one of the more useful benefits to the social science enterprise. 
The conduct of research will, in some instances, be rendered much more 
economical and efficient: There are fewer political and administrative pro- 

blems in collecting the data and the cost of merging the data is negligible 
by comparison to the cost of actually soliciting and obtaining it through 
a formal survey. 

The availability of these models may stimulate more secondary analyses 
of the data -- another economic benefit for the researcher, funding organi- 
zations and, hopefully, society. In addition, the data may be of sizeable 
volume and stable enough to permit cheap replication, an opportunity which 
cannot be considered trivial in the social sciences. 

A more generalized benefit concerns the need for explaining science 
to the public, where "public" means institutional administrations. The 
cooperation between administrators and researchers, their information ex- 
changes, and the benefits which both groups derive from this cooperation 
may contribute substantially to the integrity and to the development of 
social science* 
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Footnotes 



^Supported by NXMH Grant 1 Rl2 MH17 , 084-03. X should like to 
thank both Eli Rubenstein, D. T. Campbell, A. W. Astin and A. E. Bayer for 
providing advice or cirticism on earlier drafts of the paper. However, views 
expressed in this paper do not necessarily reflect this advice nor should 
the views of the sponsoring agency. 

^For excellent discussions of the current legal and professional 
restrictions on accessibility of a variety of organizational files, see 
Wheeler (1969). 

^In order to appraise validity of the sample in each case where 
individual subjects volunteer to respond, a post-card for each subject can 
be constructed containing only statistical information. Return of the post- 
card by the subject indicates that the subject responded to the inquiry 
and returned this response under separate cover. 

^ln some cases, the agency with responsibility for summarizing 
the data may have the computing facilities necessary for sophisticated ad 
hoc data condensations, e.g., covariance-correlation matrices, nth order 
statistics, etc. More typically, however, this capability is likely to be 
absent. One potentially useful strategy, suited for these conditions , in- 
volves micro-agregation of data, where the kind and degree of aggregation 
is fixed by policy and limits of computing facilities . Sample statistics 
(e.g., means) are theu supplied for groups, rather than individual subjects, 
and the size and kind of group must be specified a priori for maximum 
efficiency. Although micro-agregation techniques are still at a primitive 
stage of development and generally lead to inefficient estimates of para- 
meters, the techniques do appear to be generating interest and research 
simply because they are a convenient device for preserving anonymity of 
records (see Feige and Watts, 1970). 

5 Price-Waterhouse (New York) fulfills such a brokerage role for 
the Board of Medical Examiners; Agency A corresponds to the Board and Agency 
B corresponds to a Medical School aspirant who participates in an experi- 
mental testing program. 

^Numeric aliases, created' by the subject on the basis of prescribed 
formula, have been used by Professors Peter Rossi and Eugene Groves in 
mailout-mailback surveys of college students. Problems in minimizing rep- 
lication of numbers in such a group suggest that simple alphabetic aliases 
iray function at least as well; with Dr. John Cr eager, this writer has 
successfully used subject-created alias names in studies of the same kind 
of population. 
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